E-Commerce: Product Range Analysis

We have data on orders from our store for about a year. We want to make a product range analysis as well as some exploratory data analysis to learn more about our users behaviour.

Goals of the Project

These are the questions that we need to answer:

Content

Data Preprocessing

Basic Preprocessing

Importing the Data

We have 541909 entries in our dataset, in 7 columns. The columns have information on the Invoice Number, the stock code for the item and a description, the number of items that were ordered, the timestamp of the order, the unit price and the customer id.

We can already see that we will need to change some datatypes (timestamp to datetime for example) and changing the descriptions to lowercase to make sure that everything is written the same way.

The mean price for our items is 4.61, but since we also have negative values (probably from returned items) this value isn't reliable.

We also seem to have some missing descriptions and orders without customer ids, we will tackle this problem in the next steps.

The most popular item is "WHITE HANGING HEART T-LIGHT HOLDER", it was listed 2369 times.


back to content - back to basic preprocessing

Preparing the dataframe

We changed the column names, converted the timestamp into a datetime object and made sure that all strings are written in only lowercase letters.


back to content - back to basic preprocessing

Missing Values

Let's find out in which columns we have missing values and try to fill them, or decide to drop them.

There are 1.454 missing values in the description column, which equals 0.27%. We will try to fill them, but since the number is so low it probably wouldn't affect the outcome of our analysis very much if we dropped them.

For customer_id on the other hand there are a lot more missing values. In 135.080 rows or 33.2% of our entries we don't have a customer id. There could be several reasons for this. Customers might be able to place an order without registering on our site or the "orders" are not real orders but internal positions like fees or inventory adjustements.

We found 1454 missing descriptions. For all of these orders the customer_id is missing as well and there are no other items with the same order_id, so the information can't be restored. We could fill the description based on the stock code, but we still wouldn't know to which customer the order belongs so we dropped those rows entierly.

Unfortunately we couldn't find any positions with the same order ids, so there is no way to logically fill the missing values. Since the number of rows is very high and there could be valid reasons we filled them with "Unknown". This way we can still take them into account.


back to content - back to basic preprocessing

Duplicates

There is no explanation for the duplicates. If they were in the same order, the items should have been added up in one row. Since they make up less then 1% of our data we dropped them altogether.


back to content - back to basic preprocessing

Adding information

We added columns for date and month and calculated the total price for each row, based on quantity and unitprice.


back to content - back to basic preprocessing

Additional Preprocessing


back to content

Outliers in quantity and unit_price

When we looked at the general information on our data we already saw that there are a few very high values. Even though some of our customers maybe retailers, there could also be mistakes from typing in quantities or prices manually. Let's take a look.

There are two extreme outliers in quantity with around 80.000. When we only use the entries within the 99.99 percentile range the minimum and maximum values are 1.500 (positive and negative). There still are a lot of values that are outside the box, but there are no wide gaps.

In the unit prices we had one very low negative price and one outlier with a unitprice of around 40.000. Without upper percentiles there are almost no negative values and the maximum is around 1.5000. Again, there are no wide gaps but there are still a lot of outliers that don't fall within the most common price ranges. We will still keep them, since they were acutal orders and need to be taken into account.

Before we remove the outliers let's check the actual values.

Even though we chose a very conservative percentile that keeps 99.99% of our data there are outliers with much higher values. For the quantity the maximum value is 80.995 while only 0.01% of the entries have a value of more than 1.440.

For the unit price we have a similar outcome. The highest unit price is 38.970 while only 0.01% of the unit prices are higher than 2.234.

Now let's remove those outliers.

We are left with 535.096 entries. Dropping the rows with extreme unitprices didn't make a difference, they were probably the same as the ones with abnormally high quantites. This confirms that those were probably mistakes and we were right to remove them.


back to content - back to additional preprocessing

Drop rows that include 'adjustment'/'manual'/'damage'/'fee' type entries

We will remove postage, discount, dotcom postage, manual, samples, amazon fee, adjust bad debt and cruk commission from our data. These seem to be inventory counts. Customers do pay postage fees, but they are not an actual product that they order. So they are not relevant to the questions that we want to answer.

Removing the entries lost us another few hundred rows, but we still have 531.249 entries to work with and we don't have to worry about manuall adjustments and mistakes. The only thing left should be discounts and returns, which we will keep for the calculation of revenues and check sums.


back to content - back to additional preprocessing

Processing returns / negative quantities

We might have some orders that were returned/canceled, where we don't have the information on the original order. In this case the first order_id for a user would be a cancellation order with negative quantities. We will remove them from our dataset.

We removed a few hundred entries but are still left with 529.271 entries. But now we are sure that the first entry for each customer is an actual order and not a return.

Since we can't tell which return belongs to which order, let's see if we can find the right order that matches the return. This won't be 100% accurate since sometimes

Let's check if this worked or if there are returns left.

These are probably returns from users where the first two or three order_id where return orders. We only removed the first one if it had a negative quantity value. We will drop these as well.

Now we want to eliminate items that were returned. To do this we will group the same items in the same order together and sum up their quantities. So if 10 items were purchased and 5 were returned, our new dataframe will have one row with 5 items for this order.

We removed a little over 10.000 entries this way. At this point our dataframe should only have information on items that were purchased and not returned or cancelled as well as discounts (which we can't assign to a certain item, but to the total check for the order).


back to content - back to additional preprocessing

Sorting items into categories

We need to sort our unique items into a few categories so that we can analyse which categories are more popular, how many revenue they generate etc. First we should make sure that any typos or different descriptions are cleaned. To do this we will only keep one description per stock code and use it for every item with the same code.

There were around 200 items that had the same stock code but a different description. We cleaned the descriptions and now we know that we have 3992 unique products.

Now let's define our categories and add some buzzwords to them. To find these I actually looked at almost all descriptions and decided on the category. In real life this information would be available. Since the description isn't always clear on the type of item I found that using lemmatization or textblob wouldn't spare me the work so I didn't use it in this notebook.

We successfully sorted all of our products into a few categories. Most items that were sold are from the categories Living, DIY & Stationary and Kitchen. But we will go into this deeper in the upcomming analysis. Now that we're done with categorising we are ready to explore our customers behaviour more in depth.


back to content - back to additional preprocessing

Conclusion on Data Preprocessing


back to content - back to data preprocessing

EDA

We will check some general metrics to get to know the data and our customers:


back to content

Orders per customer

We have a lot of orders without customer_ids. We won't use those for the number of customers or average number of orders. Those values wouldn't be correct with them. But we can use them for the average check since we do have an order_id for those entries as well.

The majority of our customers only ordered a few times in the given period. 1.503 only ordered once and from there the count of customers steadily declines. Very few customers ordered more than 40 times. Those customers probably are retailers that frequently restock their stores.


back to content - back to EDA

Average order size

On average our customers spend only up to 500 for one order. There are a few very costly orders, some for more than 20.000, but it is very unusual to find an order with a total price of more than 2.000.


back to content - back to EDA

Total sales by category

We already found that the most items we sold were from the category Living, and we also had the highest revenue from this category. 39% of the sales were in this category, which is a total of 3.8 Million. Surprisingly "DIY & Stationary" items were the second most often sold, but they are only in fourth place when it comes to sales. Only 9.95% of our revenue came from this category (0.9 Million).

The category with the second and third highest sales are Kitchen and Living, while Household items, Travel & Garden and toys are the categories with the least revenue (only less than 3% each).


back to content - back to EDA

Monthly sales

The category 'Living' was the best one for all months. But we saw a steady increase in revenue from May 2019 to November 2019.

In general sales went up from May 2019 and reached an alltime high in November 2019. Since we only have data for one year we can't tell if this is a seasonal pattern (the month leading up to Christmas see higher revenue) or if our sales went up in general, maybe because of adding interesting items to our portfolio or marketing campaigns that attracted new customers.

The share that each category has in our sales stays more or less the same all year, except for Living, DIY & Stationary and Holidays, who all increased from August to November 2019.


back to content - back to EDA

Average order size per month

The median order size is more or less consistent throughout the year. We had a small decrease in June 2019 and and increase in September 2019, but overall the mean order size is between 270 and 300.

The mean order sizes changed a bit more, for example it was the lowest in April 2019 with only 400 and the highest in September 20 with more than 550. We obviously had some very small orders in April and maybe a few very large ones in September.


back to content - back to EDA

Average check by customer

Most orders are less than 2.000 in total. There are only a few that have total prices of up to 12.000 (even tough we cleaned items that had a very high quantity or unit price).

Not many orders have a total price of more than 500, from there to 2.000 the number of orders decreases steadily.


back to content - back to EDA

Orders per month

The number of orders slowly increases, but there is a lot of fluctuation. From December 2018 to April 2019 the number is between 1.000 and 1.500 orders per month. In May 2019 we saw a small peak with almost 1.750 orders and in October and November the number of orders went up very quick, up to 2750 orders. This was probably also related to the Christmas season.


back to content - back to EDA

Most and least sold products

The most sold item is the "Jumbo bag red retrospot" with 47.603 sold items in 2.085 orders. It is a bit more expansive than other products from the category Fashion & Accessoires but still very popular.

Other popular items are usually cheaper than the average unit price from their category. For example thr "Assorted Colour Bird Ornament" only costs 1.72 while other products in the category living have an average unit price of 4.19.

The less popular products are usually a bit more expensive than the average unit price of their category.

In this visualisation we can see that besides the jumbo bag red retrospot the white hanging hear t-light holder had the highest combination of quantity and number of different orders it was in. In this graph we find some products in the top right corner that weren't in our list of top 10 most sold items, because we only looked for the number of sold items. Here we can see that some items were ordered in bulk from the same customers while other were popular with more people.


back to content - back to EDA

Conclusion on EDA


back to content - back to EDA

Analysis


back to content

Product Life Cycle

In most cases, a product moves through three stages – Introduction, Maturity and Decline. We will find the first and last month each product was ordered, the month with the highest revenue from them and try to calculate how long a product usually is interesting for our customers or if some items don't sell at all anymore.

Products from the categories Toys, Household and Other usually have the longest lifespans and very few products there have short lifespans. These catgegories contain timeless items that are interesting for our customers throughout the whole year.

The Holdiday category on the other hand contains many products with short lifespans, they only sell during a few months of the year. Since we only have data for one year these products could of course become interesting again for next Christmas or Easter, but they don't sell well in every month.

There are 128 products that were only sold on one day, so they have a lifespan of 0.


back to content - back to Analysis

Main and additional assortement

We want to divide our products into main and additional assortment. To to this we will find items that were sold without any other products, meaning that the customer placed an order specifically to buy this item. All other items will be considered additional.

There are 1.462 products that at least once were the only item in an order.

Only 18.9 % of our products are considered "main" assortement. The vast majority is from the additional assortement, meaning that those products were never sold by themselves.


back to content - back to Analysis

Sold items by category and assortement

For the categories Kitchen, Other and Travel & Garden there is a balance between the quantity of sold items from the main and the additional assortement. For Living, DIY & Stationary, Holidays, Household and Toys more sold items are from the additional assortement. Only for Candles & Lamps and Fashion Accessoires the quantity of products from the main assortement is higher than from the additional assortement.


back to content - back to Analysis - back to main and additional assortment

Assortement split in categories

The categories Kitchen, Candles & Lamps and Toys have the highest share of products from the main assortement (more than 20% in each category). By our calculation this means that customers place orders only to by a certain product from one of these categories.

Household and Fashion & Accessoires products are more often from the additional assortement with 15% or less. Customers buy something eles and add these products to their cart, but they are not the main reason for the order.


back to content - back to Analysis - back to main and additional assortment

Assortement Revenue

Our assignment of "main" and "additional" seems to make sense. Even though there are a lot more products in the additional assortement, more than half of our revenue comes from the main assortement (56.7% or 5.56 Million).


back to content - back to Analysis - back to main and additional assortment

Product Bundles

In this step we want to find products that were often sold together. We will not use returned items or discounts for this calculation and limit the bundle size to 2 products.

The most popular bundle is "Jumbo bag pink polkadot" combined with "jumbo bag red retrospot", they were bought together 825 times. In general a lot of those bags are in out top 15 bundles.

We already established that a lot of our customers probably are retailers, so it makes sense that they might buy different varieties of the same product, like different designs and colors.


back to content - back to Analysis

Conclusion on Analysis


back to content - back to Analysis

Hypothesis Testing

We will run a statistical test to find out if our hypothesis is true:

Candles are sold more often in the Winter, because they make your home cozy and comfortable

Just as we suspected, there is as statistical significant difference in the share of orders that contained candles between winter and other seasons. The mean is 56% in winter and in other seasons 51.7% of all orders contained candles or lamps. This doesn't look like a big difference, but our test tells us, that there is a statistical difference between both groups.


back to content

General Conclusion

Key insights and recommendations


Link to the dashboard:
https://public.tableau.com/app/profile/laura.str.mpler/viz/p100eCommerce/SalesDashboard


Link to the presentation:
https://drive.google.com/file/d/1HjEduSYEL1HNWY9QkuBID2RqzwIqmmHg/view?usp=sharing


back to content

image.png